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Machine-designed control of complex devices or experiments can discover strategies superior to 
those developed via simplified models. We describe an online optimization algorithm based on 
Gaussian processes and apply it to optimization of the production of Bose-Einstein condensates 
(BEC). BEC is typically created with an exponential evaporation ramp that is approximately op¬ 
timal for s-wave, ergodic dynamics with two-body interactions and no other loss rates, but likely 
sub-optimal for many real experiments. Machine learning using a Gaussian process, in contrast, 
develops a statistical model of the relationship between the parameters it controls and the quality 
of the BEC produced. This is an online process, and an active one, as the Gaussian process model 
updates on the basis of each subsequent experiment and proposes a new set of parameters as a 
result. We demonstrate that the Gaussian process machine learner is able to discover a ramp that 
produces high quality BECs in 10 times fewer iterations than a previously used online optimization 
technique. Furthermore, we show the internal model developed can be used to determine which 
parameters are essential in BEC creation and which are unimportant, providing insight into the 
optimization process. 


Experimental research into quantum phenomena of¬ 
ten requires the optimization of resources or processes 
in the face of complex underlying dynamics and shifting 
environments. For example, creating large Bose-Einstein 
condensates (BECs) with short duty cycles is one of the 
keys to improving the sensitivity of cold-atom based sen¬ 
sors [T] or for performing scientific investigation into con¬ 
densed matter phases |2], many-body physics [3] and non¬ 
equilibrium dynamics [4]. The standard process of BEC 
production is evaporative cooling [5]; microscopic semi- 
classical theory exists to describe this process [6], but 
it can oversimplify the dynamics and miss more complex 
and effective methods of performing evaporation. For ex¬ 
ample, Shobu et al. [7] found circumventing higher order 
inelastic collisions can produce very large condensates. 
‘Tricks’ like this are likely to exist for other species with 
complicated scattering processes [8j, but discovery is only 
possible by experimentation. We automate this process 
of discovery with machine-learning online optimization 
(MLOO). What distinguishes our approach from previ¬ 
ous methods for automation is that we seek to develop a 
statistical model of the relationship between parameters 
and the outcome of the experiment. We demonstrate that 
MLOO can discover condensation with less experiments 
than a competing optimization method and provide in¬ 
sight into which parameters are important in achieving 
condensation. 


Online optimization (00), with mostly genetic [9] 20] 
but also gradient [21] and hybrid solvers [22] [23], has 
been used to enhance a variety of quantum experiments. 
Here, online means optimization is done in real time with 
the experiment. We apply OO based on machine learn¬ 
ing. What distinguishes our approach is that it does not 
only seek to optimize the experiment, but also creates an 
internal model that is able to predict the performance of 
future experiments given any set of parameters. This is 
achieved by modeling the experiment using a Gaussian 
process (GP) [24]. Online machine learning (OML) with 
GPs j24l[28] has been applied in a variety of areas in¬ 
cluding robotics H3H3 , vision industrial chemistry 

|32j 33 and biochemistry [34]. However, in all of these 
cases, the focus was not on optimization. Rather, the 
goal was the development of an accurate model. We here 
combine the advantages of OML with the motivation of 
OO. The resultant MLOO algorithm has the following 
beneficial features: every experimental observation from 
an optimization run is used to improve the GP model, 
and uncertainties in the measurements can be correctly 
accounted for; our algorithm is both deterministic and 
able to find global minima, instead of randomly explor¬ 
ing the parameter space our learner uses its estimate of 
the variance from the GP to pick parameter values where 
we are most uncertain about the value of the cost func¬ 
tion; fast gradient-based optimization routines can not 
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Figure 1. The experiment and the ‘learner’ see each other as a black box. The learner produces a parameter set, X , for the 
experiment to test, these are converted into cooling ramps and used to perform an experiment. After the evaporation process 
is finished, an image of the cold atoms taken is used to calculate a cost function based on its quality as a resource C(X). C(X) 
is then fed back to the learner. The learner uses a GP to model the relationship between the input parameter values and the 
cost function values produced by the experiment. This model depends on a set of correlation lengths, or hyperparameters. Part 
(a) of the figure plots a set of observed costs (black circles with bars for uncertainty) with three possible GP models fit to the 
data: one with a long correlation length (red dotted), a medium correlation (blue solid) and a short correlation length (green 
dashed). Each GP is illustrated by a mean cost function bracketed by two curves indicating the function +/- one standard 
deviation from the mean. The correlations lengths affect both the mean and uncertainty of the model; note that the uncertainty 
approaches zero near the observed points. A final cost function is then produced as a weighted average over the most likely 
correlation lengths. This predicted model is then used to pick the next parameter set to test X , which is fed to the machine. 


be applied directly to our experimental data as it is too 
noisy, but they can be employed to efficiently find optima 
in our smooth machine-generated model; estimation of 
parameter sensitivity and visualization of the functional 
dependence of the resource’s quality can inform exper¬ 
imentalists on how to best develop future optimization 
experiments. 

The experimental apparatus is described in detail in 
[35] . Initially 87 Rb atoms are cooled in a combined 2D 
and 3D MOT system and subsequently cooled further 
by RF (radio frequency) evaporation. The cloud is then 
loaded into a cross beam optical dipole trap for the final 
evaporation stage. It is this stage that is the subject of 
the optimization process. The cross dipole trap is formed 
from two intersecting 1090nm and 1064nm lasers with ap¬ 
proximate waists of 100/im and 60/im respectively. The 
depth of the cross trap is determined by the intensity 
of the two beams. The 1064nm beam is controlled by 
varying the current to the laser, while the 1090nm beam 
is controlled using the current and a waveplate rota¬ 
tion stage combined with a polarizing beamsplitter to 
provide additional power attenuation while maintaining 
mode stability. A diagram of the experimental set up is 
shown in figure 1. Normally the power to these beams 
is ramped down over time, thereby lowering the walls of 
the trap and allowing the higher energy atoms to leak 
out. The remaining atoms rethermalize to a lower tem¬ 
perature, enabling cooling. Once the gas has been cooled 
to temperatures on the order of nK, a phase transition 
occurs, and a macroscopic number of atoms start to oc¬ 
cupy the same quantum state. This transition is called 
Bose-Einstein condensation [36] . We hand over control 
of these ramps to the MLOO. We consider two parame- 


terizations: one simple, where we only control the start 
and end points of a linear interpolation; and one com¬ 
plex, where we add variable quadratic, cubic and quartic 
corrections to the simple case (see Appendix). 

The approach we propose is a form of supervised learn¬ 
ing, meaning that we provide the learner with a number 
that quantifies the quality of the resource produced or in 
optimization terminology a cost that must be minimized. 
Naively one might try to use a measure based on temper¬ 
ature and particle number. However determining these 
quantities accurately near condensation is difficult when 
constrained to very few runs per parameter set. Instead, 
a semi-heuristic method is used to calculate the cost. An 
absorption image of the final state of the quantum gas 
is taken after a 30ms expansion of the cloud, with the 
image providing the optical depth as a function of space. 
The cost is calculated from all data between a lower and 
upper threshold optical depth. The lower threshold is de¬ 
termined by the noise in the system. The upper threshold 
is required since the absorption images are taken on reso¬ 
nance and hence saturate for areas with very large optical 
depth. The upper threshold is set slightly below this sat¬ 
uration level. Only data from between the bounds is used 
and cost is simply the average of these values. In practice 
this means the sharper the edges of the cloud, the lower 
the cost. Indeed, low quality thermal clouds have broad 
edges, whereas the ideal BEC has much sharper edges. 
Each parameter set is tested twice with the average of 
the two runs used for the cost. Tests of the variation 
in cost for a set of parameters run-to-run indicate they 
obey a Gaussian distribution. As such we are able to es¬ 
timate the uncertainty from two runs as twice the range. 
In doing so, the chance we have underestimated the un- 





































3 


certainty will be 27%. We therefore also apply bounds to 
the uncertainty to eliminate outliers overly affecting the 
modeling process. The cost function can be evaluated as 
long as some atoms are present at the end of the evap¬ 
oration run. In cases where the evaporation parameters 
produced no cloud twice for a set of parameters, we set 
the cost to a default high value. 

We treat the experiment as a stochastic process ^(X) 
which is dependent on the parameters X = (aq, • • • , %). 
When we make a measurement and determine a cost, 
we interpret this as a sample of this process C(X) with 
some associated uncertainty U(X). We define the set 
of all parameters, costs and uncertainty previously mea¬ 
sured as X = (Xi, • • • Wv), C = (Ci,--- ,CW) and 
U = (Ci, • • • , Un) respectively and collectively refer to 
these sets as our observations O = (X,C,U). The aim of 
00 is to use previous observations O to plan future ex¬ 
periments in order to find a set of parameters that min¬ 
imize the mean cost of the stochastic process M<g(X). 
Unique to the MLOO approach, we first make an es¬ 
timate of the stochastic process given our observations 
^(X\ 0), which is then used to determine what parame¬ 
ters to try next. 

We model ^(X) as a GP — a distribution over 
functions — with constant mean function and covari¬ 
ance defined by a squared exponential correlation func- 

j=1 ( xj - xj ) /h j where H = 

(hi, • • • , Hm) is a set of correlations lengths for each of 
the parameters. The mean function conditional on the 
observations O and correlations lengths H of our GP is: 
Hcg{X\ 0, H), which is evaluated through a set of matrix 
operations [24 (see Appendix). As we are using a GP, we 
can also get the variance of the functions conditioned on 
O and H\ a^(X\0, H) [24]. Both of these estimates de¬ 
pend on the correlation lengths H, normally referred to 
as the hyperparameters of our estimate. We assume that 
H is not known a priori and needs to be fitted online. 

The correlation lengths H control the sensitivity of the 
model to each of the parameters, and relates to how much 
a parameter needs to be changed before it has a signifi¬ 
cant effect on the cost (see Fig. 1). A standard approach 
to fit H is maximum likelihood estimation m■ Here, the 
hyper parameters are globally optimized over the likeli¬ 
hood of the parameters H given our observations 0, or 
L(H\0) [24] (see Appendix). However, when the data set 
is small there will often be multiple local optima for the 
hyperparameters whose likelihoods are comparable to the 
maximum. We term these hyperparameters the hypoth¬ 
esis set H = (Hi, • • • ,Hp) with corresponding likelihood 
set C = (Li, • • • ,Lp). 

To produce our final estimates for the mean func¬ 
tion and variance we treat each hypothesis as a parti¬ 
cle [28] . and perform a weighted average over H. The 
weighted mean function is now defined as M^(X\0, H) = 
Y^ = i w il J 'cg(X\0,Hi) and weighted variance of the 



Figure 2. The optimization of the evaporation stage of cre¬ 
ating a BEC using the complex 16 parameter scheme. The 
first 20 evaluations are an initial training run using a sim¬ 
ple Nelder-Mead algorithm. The machine learning algorithm 
(green) then quickly optimizes to BEC. The insets show the 
different regimes that the experiment goes through, from a 
large, completely thermal cloud through to the sharp edged 
BEC. 

functions is Y^{X\0,U) = Ef=i H t ) + 

/4(X| Q,Hi)) - M£(X\0,H), where w t = E/Ef= xU 
are the relative weights for the hyperparameters. Now 
that we have our final estimate for C £(X\0,T-L), we need 
to determine an optimization strategy for picking the next 
set of parameters to test. 

Consider the following homogeneous strategies: we 
could test parameters that minimize M^(X), but this 
strategy can get trapped in local minima; or we could 
test parameters that maximize SA(X) (i.e. where we 
are most uncertain), but this may require a large num¬ 
ber of trials to map the space and would not prioritize 
refinement of the minima. We chose to implement an 
inhomogeneous strategy that repeatably sweeps between 
these two extremes by minimizing a biased cost func¬ 
tion: B^(X) = bM^(X) — (1 — b)E^(X), where the 
value for b is linearly increased from 0 to 1 in a cycle of 
length Q. This makes the learner change strategy from 
getting maximum information (b = 0), to looking for a 
new minima: with a high risk-seeking (b small) to risk- 
neutral (b = 1) preference. During testing with synthetic 
data, we found sweeping was more robust and efficient 
than the homogeneous approaches. When we minimized 
B^(X) we also put bounds, set to 20% of the parameters 
maximum-minimum values, on the search relative to the 
last best measured X. We call these bounds a leash , as 
it restricts how fast the learner could change the param¬ 
eters but did not stop it from exploring the full space 
(similar to trust-regions 37, 381). This was a technical 
requirement for our experiment: when a set of parame¬ 
ters was tested that was very different from the last set, 
the experiment almost always produced no atoms, mean¬ 
ing we had to assign a default cost that did not provide 
meaningful gradient information to the learner. Once the 
next set of parameters is determined they are sent to the 
experiment to be tested. After the resultant cost is mea¬ 
sured this is then added to the observation set 0 with 
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Figure 3. Optimization of evaporation curves to produce a 
BEC. The first 2N evaluations use a simple Nelder-Mead al¬ 
gorithm to learn about the cost space. The machine learning 
algorithm (red and blue) optimizes to BEC faster than the 
Nelder-Mead (black). By utilizing the machine learning model 
a parameter is eliminated and the convergence improves (red). 


N N + 1 and the entire process is repeated. 

We emphasize that fitting of H , estimation of M^(X) 
and minimization of B^(X) is all done online while the 
experiment is being run. In a single optimization run, the 
learner typically performs hundreds more hypothetical 
experiments than the number physically run in the lab. 
The MLOO algorithm we developed is open source and 
available online at [32] (it uses the package scikit-learn 
m to evaluate the GPs). 

As a benchmark for comparison, we also performed 00 
using a Nelder-Mead solver m, which has previously 
been used to optimize quantum gates [23]. 

We demonstrate the performance of machine learning 
online optimization in comparison to the Nelder-Mead 
optimizer in Fig. 2. Here we used the complex parame¬ 
terization for all 3 ramps, and added an extra parameter 
that controlled the total time of the ramps, resulting in 
16 parameters. If we were to perform a brute force search 
and optimize the parameters to within a 10% accuracy of 
the parameters maximum-minimum bounds, the number 
of runs required would be 10 16 . The Nelder-Mead algo¬ 
rithm is able to find BEC much faster than this, in only 
145 runs. The machine learning algorithm, on the other 
hand, is much faster. After the first 20 training runs, 
where the machine learning and Nelder-Mead algorithm 
use a common set of parameters, the machine learning 
algorithm converges in only 10 experiments. 

The learner used in Fig. 2 only used the best hypothe¬ 
sis set when picking the next parameters, in other words 
we set P = 1. Evaluating multiple GPs is computation¬ 
ally expensive with so many parameters, so to save time 
we made this restriction. In spite of this, the learner dis¬ 
covered ramps that produced BEC in very few iterations. 
This is because the learner consistently fitted the corre¬ 
lation lengths of the 3 most important parameters — the 
end points of the ramps — very quickly. However, we 
found the other correlations lengths were not estimated 
well and would not converge, even after a BEC was found. 
This meant that we were unable to make useful predic¬ 
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Figure 4. Plots of the cross sections through the minima of 
the cost landscape as predicted by the learner. In (a) the 
predicted cost is shown as a function of the end of the polar¬ 
ization ramp (red), the end of the dipole beam ramp (green) 
and the unconnected parameter (blue). The learner correctly 
identifies that the unconnected parameter does not have a 
significant effect on the production of BEC. In (b) a cross 
section of the 2 most sensitive parameters are plotted against 
cost. 


tions about the cost landscape and we could not reliably 
determine what parameters were least important. 

Gramacy et al. [28j have suggested that making good 
online estimation of the GP correlation lengths requires 
multiple particles. We considered achieving this goal in 
a different experiment as shown in Fig. 3. Here we used 
a learner with many particles P = 16, but had to use the 
simple parameterization for the ramps to save computa¬ 
tional time. This resulted in a total of 7 parameters. We 
can see again the overall trend for the machine learner is 
still faster than Nelder-Mead, but less pronounced. More 
carefully estimating the correlation lengths has hindered 
the convergence rate compared to the 16 parameter case. 
Nevertheless, as we now have a more reliable estimate of 
the correlation lengths we can take advantage of a differ¬ 
ent feature of the learner. 

In Fig. 4(a) we show estimates of the cost landscape as 
ID cross sections about the best measured point. We plot 
the two most sensitive parameters and the least. We can 
see the least sensitive parameter appears to have no effect 
on the production of a BEC. This parameter corresponds 
to an intentionally added 7th parameter of the system 
that controls nothing in the experiment. Fig. 4(a) shows 
the learner successfully identified this, even with such a 
small data set. After making this observation we can 
then reconsider the design of the optimization process 
and eliminate this parameter from the experiment. 

In Fig. 3 we plot the machine learner optimization run 
with P = 16 but now with only 6 parameters. We can 
see the learner converges much more rapidly than the 7 
parameter case, and even produces a higher quality BEC. 
As the learner no longer takes extra runs to determine 
the importance of the useless 7th parameter, it achieves 
BEC quite rapidly. Lastly, to give us some understanding 
of the complexity of the landscape, we plot a 2D cross 
section of the landscape against the two most sensitive 
parameters in Fig. 4(b) generated from the 6 parameter 
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machine learning run. We can see there is a very sharp 
transition to BEC, as it exists in a very deep valley of 
the landscape. 

We have demonstrated that ML00 can discover evap¬ 
oration ramps that produce high quality BECs with far 
fewer experiments than 00 with Nelder-Mead. Rapid 
optimization of ultra-cold-atom experiments is not just a 
useful tool to overcome technical difficulties, but will be 
vital in the application of BECs in proposed space-based 
scientific investigations [42,(43]. Furthermore, when im¬ 
plemented with many particles, the learner can be used to 
make estimates of the cost landscape to determine what 
parameters most contributed to BEC production and aid 
better experimental design. In future work we will ap¬ 
ply MLOO to atomic species with more exotic scatter¬ 
ing properties [8] in order to find novel cooling ramps 
that find optimal solutions in the competition between 
poorly characterized complex dynamical processes. Our 
approach is generic and available [39] for use on other sci¬ 
entific experiments, ultra-cold atom based or otherwise. 
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APPENDIX 

Gaussian process evaluation: In practice, evaluating a 
Gaussian process (GP) reduces to a set of matrix opera¬ 
tions whose derivation is given by Rasmussen et al. [24] in 
section 2.7. Consider N previous experiments have been 
performed with parameter sets X = (Xi ,---X/v) (each 
Xj = (xij, • • • xmj ))> measured costs C = (Ci, • • • , Cn) 
and uncertainties U = (Ui, • • • , Un)- We refer to the set 
of this data as our observations O = (X,C,U). We fit 
a GP to these observations with constant function offset 
/3 and covariance defined by a squared exponential cor- 

j = l( X 3,P~ X j,q) / h j 

where H = (hi, • • • , Hm) are the hyperparameters of the 
model. 

The mean function and variance of the functions are: 

ii ci {X\0,H)=p + r(X) T 1 (1) 

o%(X\0,H) =a 2 c (l-r(X) T R~ 1 r(X) 

+ (j T R- 1 j)- 1 (j T R~ 1 r(X)-lf) (2) 

where a\ is the variance of the costs C, and we define the 
constant (3 = (j T R~ x j)~ x j T R~ X Y, the Nx 1 vector r(X) 


such that {r(X)}i^ = K(X,Xi,H), the N x 1 vector 
7 = R~ 1 (Y — j/3), the N x 1 vector Y of the costs defined 
by {Y} m = Ci , the N x 1 vector {j}i,i = 1, the N x N 
matrix R defined as {R}ij = K(Xi, Xj , H) -\~SijUf , and 
where Sij is the Kronecker delta function. {-}ij is our 
notation for the ith row and j th column of a matrix or 
vector. 

When finding the most likely hyperparameters we 
maximize the likelihood function. The likelihood L(H\0) 
is defined as the probability of the costs given the param¬ 
eters, uncertainties and hyperparameters: P(C\X,U,H), 
the log of which is: 

logP(C\X,U,H) =l(-log|i?| - log fR-'j 

— (N- l)log2w-Y t (R~ 1 

- (j T R- 1 j)- 1 R- 1 jj T R- 1 )Y) (3) 

Parameterizations of evaporation ramps: The simple pa¬ 
rameterization of the evaporation ramps is 

Ks(yi,yf,tf) = yi + (yf-yi)f (4) 

where yi and yf specify the start and end amplitudes of 
the ramps and tf specifies the length in time. 

The complex parameterization an extension of the sim¬ 
ple form: 

n c (yi , y f , Ai , A 2 , A 3 ,tf) = 

yi + (yf -yi)f + a 2t (t - t f ) 
tf 

+ A 3 t (■ t - t f ) (t + 

+ A 4 t (t - tf) (t + -tfj (t + -tfj (5) 

where A i, A 2 and A 3 correspond to the 3rd, 4th and 
5th order polynomial terms respectively with each poly¬ 
nomial having evenly spaced roots between t = 0 and 
t = tf. As with the simple parametrization tf specifies 
the end of the ramps in time. 

In each of the three ramps being optimized, the pa¬ 
rameters yi , yf , A 1 , A 2 , A 3 are independent. However, 
the final time tf is common. 


